Data Pre-processing

Extracting Video Ids from Youtube of Ted Talks

Data Cleaning

Non-speech sounds, events

Examples of parentheticals (non-speech sounds)

(Applause)(Applause ends)(Pre-recorded applause)(Pre-recorded applause and cheering)(Audience cheers)(Laughter)(Shouting)(Mock sob)(Breathes in)(Baby cooing)(Video)(Singing)(Heroic music)(Loud music)(Music)(Music ends)(Plays notes)(Sighs)(Clears throat)(Whispering)

Four important steps for cleaning the text and getting it into a format that we can analyze:

Vectorization

Vectorization is the important step of turning our words into numbers. This function takes each word in each document and counts the number of times the word appears. You end up with each word as your columns and each row is a document (talk), so the data is the frequency of each word in each document, we call this a sparse matrix.

Vectorization will come in use with Topic Modelling.

Exploratory Data Analysis